Personnel
Overall Objectives
Research Program
Application Domains
New Software and Platforms
New Results
Bilateral Contracts and Grants with Industry
Partnerships and Cooperations
Dissemination
Bibliography
XML PDF e-pub
PDF e-Pub


Section: New Results

Improving Characterization of NUMA Architectures through Applications' Kernels

Participants : Philippe Virouleau, Francois Broquedis, Thierry Gautier, Julien Langou [UCD, USA] , Fabrice Rastello.

Programmers need tools to be able to study their applications. When targeting NUMA architectures, many existing tools allow to observe and identify the critical parts of the application. However there is a need for tools that enable programmers to clearly understand how critical parts of their applications behave, and how they could be improved on a given architecture.

In the context of data-flow applications each part - task - of the application is clearly identified in the data-flow graph. All manipulated data are also clearly available as, within such framework, they constitute what links tasks with one another.

On NUMA architectures, a task's execution time depends, among others, on both the core which executes the task and the NUMA node on which has been allocated its data. Assume one can characterize a task behavior (with regard to its execution context) as follow: run it in isolation from the overall application, and change various of its properties (such as the size of input or the placement of data). Then the scheduler of a run-time system can use this characteristic to improve the overall performance: It would have full information about what is running on the machine (e.g.: on the same NUMA node as the idle thread), and could sort the tasks ready for execution according to how good their behavior would be on the idle thread, given the current state.

We designed a tool which goal is to help the user execute a given scenario on the architecture. This scenario describes:

The tool guarantees that the scenario will be executed correctly on the architecture, letting him focus on understanding on his application rather than taking care of the low level implementation details.

We applied this approach to a dense linear algebra algorithm: the Cholesky factorization. It has enabled us to profile the four kernels of the application by running them in various configurations of data placements, sizes, and concurrent workload. We believe we've tested enough configurations to reliably find the best and worst cases for all the kernels Assuming the behavior of the kernel stays the same within the application, we've been able to estimate upper-bound and lower-bound execution time for the overall application given those best and worst cases.